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This  paper  describes  the  measurement  and  analysis 
of  permanent  CPU  related  errors  and  system  activity 
at  the  Stanford  Linear  Accelerator  Center  computa¬ 
tion  facility.  Between  13  and  18  percent  of  all 
errors  affecting  the  CPU  were  estimated  to  be  per¬ 
manent.  The  manifestation  of  a  permanent  error  was 
found  to  be  strongly  correlated  with  the  level  and 
type  of  workload  prior  to  the  manifestation  of  the 
error.  For  example,  it  is  shown  that  the  risk  of  a 
permanent  error  increases  in  a  non-linear  fashion 
with  the  amount  of  interactive  processing.  The 
observed  tendency  is  present  in  three  years  of  load 
data.  This  observation  is  significant  because  a 
load-error  relationship  found  at  the  CPU  level 
must,  in  our  view,  be  considered  fundamental.  In 
addition,  in  a  majority  of  the  observed  errors,  the 
latency  between  the  occurrence  and  the  manifesta¬ 
tion  of  the  error  was  estimated  to  be  insignificant 
for  the  purposes  of  our  analysis.  Thus  the  detec¬ 
tion  of  the  error  also  provides  an  estimate  of  the 
occurrence  of  the  error. 

Keywords  ?  Workload  and  error  measurement,  data 
analysis,  statistical  models. 
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The  highly  interactive  and  diverse  nature  of  modern 
day  systems  has  made  high  reliability  a  central 
issue  in  computer  system  design.  It  is  not.  in 
general.  feasible  to  guarantee  a  perfect  system, 
either  in  hardware  or  in  software.  Accordingly, 
depending  on  the  nature  of  the  application.  it  is 
important  to  design  into  the  system  the  ability 
either  to  continue  operation  in  the  event  of  a 
failure  or  to  react  to  a  failure  in  a  predictable 
manner . 

Theoretical  models  can  only  deal  with  a 
restricted  class  of  problems.  Most  often  it  is  the 
problems  outside  the  range  of  theoretical  models 
which  cause  the  most  severe  malfunctions.  Accord¬ 
ingly,  at  this  stage  there  is  no  better  substitute 
for  results  based  on  ectual  measurements  and  exper¬ 
imentation.  An  experimental  study  provides  not 
only  a  view  of  tho  end  product  but  also  gives  some 
insight  into  persistent  problems.  This  information 
can  be  very  valuable  in  designing  new  systems. 

This  paper  describes  the  measurement  and  analy¬ 
sis  of  permanent  CPU  related  errors  and  system 
activity  at  tha  Stanford  Linear  Accelerator  Center 
(SLAC)  computation  facility.  The  authors'  approach 
has  been  to  start  with  a  substantial  body  of  empir¬ 
ical  data  on  system  load  and  errors.  The  measure¬ 


ment  process  is  automatic;  ft  captures  a  detailed 
internal  view  of  the  system,  especially  under  error 
conditions.  From  the  measurements*  a  completely 
new  data  base  of  errors  and  workload  was  estab¬ 
lished  in  order  to  match  errors  with  workloads  at 
the  times  of  error.  On  the  basis  of  these  measure¬ 
ments  several  experiments  were  conducted  to  examine 
the  dependence  of  all  CPU  related  errors  on  system 
activity.  A  CPU  related  error  is  defined  as  one 
which  affects  the  normal  operation  of  the  CPU;  it 
could  originate  in  the  CPU  itself,  in  the  main  mem¬ 
ory  or  in  a  channel.  The  present  study  concen¬ 
trates  on  permanent  CPU  related  errors.  Between  13 
and  18  percent  of  all  CPU  errors  were  estimated  to 
be  perm^nen t . * 

The  measurements  and  statistical  experiments 
clearly  demonstrate  a  non-linear  increase  in  the 
risk  of  observing  permanent  CPU  related  errors  due 
to  increased  values  of  workload  variables.  Exam¬ 
ples  are  CPU  utilization.  input/output  rate,  and 
interrupt  rates. 

A  representati ve  measurement  is  illustrated  in 
Fig.  1,  which  shows  how  an  increase  in  the  system 
CPU  usage.  SVSCPU.  (a  measure  of  the  system  over¬ 
head;  a  fraction  between  0  and  1)  can  result  in 
higher  risk  of  permanent  errors  in  the  CPU  and  main 
memory.  The  horizontal  axis  is  the  workload  vari¬ 
able;  the  vertical  axis  is  the  risk  of  error.  Mod¬ 
eling  details  will  be  given  later  in  this  paper. 
Ue  estimate  that  in  a  majority  of  the  observed 
errors  the  latency  between  the  error  occurrence  and 
its  manifestation  was  insignificant  in  comparison 
with  the  time  required  to  produce  a  measurable 
change  in  the  average  workload  values  used  in  the 
analysis.  Thus,  as  far  as  the  measured  workload 
values  are  concerned,  the  manifestation  of  a  perma¬ 
nent  error  almost  coincides  with  its  occurrence. 

Rclaltd  fttatarcti  tad  Pot  1 YBt iqq 

There  is  now  considerable  experimental  evidence  to 
show  that  computer  reliability  is  a  dynamic  func¬ 
tion  of  system  activity  (as  measured  by  the  work¬ 
load).  A  number  of  studies  tButner  803,  liver 
82a. b]  and  [Castillo  60.  61}  provide  statistical 
evidence  on  a  number  of  machines  to  support  this 
observation.  Even  though  the  exact  nature  of  this 
dependency  is  not  fully  understood,  it  would  appear 
that  that  computing  systems.  which  need  maximum 


1  Between  75  and  85  percent  of  all  errors  were  tem¬ 
porary  (transient  or  intermittent)  and  are  dis¬ 
cussed  m  [Iyer  6?b3  and  [Rossetti  61]. 


Subsequently  a  more  detailed  and  accurate  study 
was  performed  on  all  CPU  error*  (Iyer  82bj.  It  we* 
found  that  all  errors  affecting  the  CPU  correlated 
stronoly  with  system  activity.  however  the  large 
major. ty  of  these  errors  <75  to  85  percent)  were 
temporary.  (lore  recent  studies  conducted  on  the 
IGrt  3081  at  SLAC  found  a  similar  behavior  with 
software  related  failures  on  vn/370  [Rossetti  82] 
Additional  substantiation  of  these  results  came 
from  experimental  studies  on  DEC  systems  reported 
i n  [Cast 1 1 1 o  80] . 
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figure  1:  Risk  of  error  vs.  system  CPU  usage 
( f ract ion ) . 


reliability  at  their  peak  load,  require  a  re-evalu¬ 
ation  of  their  reliability  projections. 

An  important,  and  as  vet  unanswered,  question  is 
whether  an  increased  level  of  system  activity 
results  m  an  increased  level  of  hardware  failures. 
In  particular,  it  is  important  to  datarmine  whether 
permanent  failures  in  logic  elements  (CPU  and  stor¬ 
age)  are  also  workload  dependent  i.e.,  whether 
higher  system  activity  results  in  a  higher  level  of 
CPU  and  memory  failures. 

Some  evidence  to  this  effect  was  available  from 
an  early  analysis  of  failures  on  the  SLAC  Triplex 
[Iyer  82a],  The  study  found  a  strong  correlation 
between  the  occurrence  of  hardware  failures  and  the 
toad  on  the  system,  as  measured  by  variables  such 
as  the  paging  rate  and  the  jobstep  processing  rate. 
All  failures  were  considered.  not  simply  ihe  ones 
which  led  to  system  service  interrupt  ions.  Host 
importantly,  the  effects  were  such  that  the  average 
failure  rate  for  various  system  components  varied 
eye  I i c I y  over  a  band  of  significant  width  as  deter¬ 
mined  by  the  daily  load  variations.  rig.  2  >s  a 
representati ve  histogram,  from  that  study,  of  all 
permanent  CPU  failures  plotted  by  the  hour  of  day, 
averaged  over  1978. 

In  a  majority  of  the  cases,  the  time  between  the 
occurrence  of  a  failure  and  its  manifestation  was 
estimated  to  be  insigni f leant .  This  also  matches 
with  the  observation  end  experimental  results 
reported  in  Clala  83].* 


1  The  study  reports  extremely  small  latency  times 
(less  than  1  second)  for  detectable  faults.  Less 
than  20  percent  of  injected  faults  were  not 
detected  and  a  vast  majority  of  these  were  due  to 
unused  gates  or  on  signals  which  were  always  low 
or  high. 
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Figure  2:  CPU  failures  by  hour  of  day  (SLAC 
Tr i p I  * x ) . 


There  has  been  some  effort  at  modelling  the 
observed  load/failure  relationship.  Possible 
cause-effect  scenarios  are  discussed  in  [Butner  80] 
and  [Iyer  82a].  Castillo  and  Siewiorek  [Castillo 
81.82]  have  proposed  the  use  of  a  doub I y-stochss t ic 
Poisson  process  to  model  the  cyclic  load-failure 
rel at lonship.  The  model  assumes  that  the  instanta¬ 
neous  failure  rate  can  be  described  by  a  cyclosta- 
tionary  Gaussian  process.  In  [Gunther  80]  a  novel 
theoretical  model  for  an  apparent  aependrncy  of 
failures  on  load,  based  on  a  random  walk  formula¬ 
tion,  is  described.  There  is  no  doubt  that  more 
detailed  experimental  results  are  necessary  before 
a  clear  understanding  of  the  observed  behavior  is 
possib I e . 


The  next  section  discusses  the  error  and  work¬ 
load  measurements  taken  and  briefly  presents  the 
organization  of  the  data.  Subsequent  sections 
describe  the  procedures  employed  to  analyze  perma¬ 
nent  errors  and  present  new  results  Finally,  the 
paper  summarizes  the  important  results  and  high¬ 
lights  the  conclusions  tfcat  can  be  drawn  from  them 
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As  stated  earlier.  the  present  study  uses  the  most 
detailed  data  from  the  log  maintained  by  the  oper¬ 
ating  system  as  errors  are  detected  by  the  hardware 
and  recorded  by  the  software.  High  level  system 
behavior,  as  seen  by  the  computer  operator  and 
users,  is  not  directly  measured.  Instead,  there  is 
much  information  on  hardware  errors,  both  recovera¬ 
ble  and  non-recoverabl e.  as  they  occur  in  the 
detailed  operation  of  system  components. 

The  SIAC  system,  during  the  period  of  our  study, 
consisted  of  two  IBM  370/16$  mainframes  and  an  I  BM 
360/91  connected  in  a  triplex  mode.  The  data  for 
our  study,  which  consisted  of  three  years  of  meas¬ 
urements  (1979,  19$Q,  and  1981),  came  from  the  two 
IBM  370/168  mainframes.  The  log  referred  to  above 
is  commonly  called  SYS1.L0GREC  or  the  ERCP  log, 
from  the  Environmental  Recording  Editing  and  Print¬ 
ing  program  used  to  accumulate  and  format  it  for 
maintenance  (IBM  79]. 

Errors  in  1  Bfl  370  systems  are  classified  into 
three  major  types: 

1  CPU  Errors  -  In  the  central  processor  and  stor¬ 
age. 

2.  Channel  Errors  -  In  I/O  channels  and  associated 
interfaces. 

3.  Outboard  £rf9ra  -  In  any  device  beyond  the 
channel -control  unit  interface.  i.e.  all 
errors  in  I/O  devices. 

For  each  error,  whether  recoverable  or  not.  the 
operating  system  creates  a  time-stamped  record 
describing  the  error  and  providing  relevant  infor¬ 
mation  on  the  state  of  the  machine.  As  an  example, 
for  a  CPU  error,  the  state  information  might 
include  the  contents  of  all  internal  registers  and 
diagnostic  information  collected  by  the  hardware 
(such  as  parity  indicators  and  error  flags).  At 
Si AC  this  information  is  collected  on  a  daily  basis 
and  archived  for  many  years. 

Since  the  10GRCC  data  does  not  specifically  pro¬ 
vide  information  on  permanent  failures,  it  is  nec¬ 
essary  to  estimate  this  information  from  the  data. 
It  was  noticed  that  those  errors  which  commonly 
occurred  in  large  bursts  within  a  short  period  with 
the  same  error  symptom  were  almost  always  due  to  a 
permanent  hardware  failure.  The  vast  majority  of 
these  errors  were  in  main  storage.  The  following 
rule  was  therefore  used  to  estimate  a  hardware 
failure:  If  the  machine-check  condition  inter¬ 

rupted  the  CPU  and  recurred  four  times  or  more  in 
rapid  succession,  the  error  was  considered  to  be  a 
permanent  error  in  the  CPU  or  main  memory.  Oiscus- 
sions  with  SLAC  systems  and  maintenance  people 
showed  that  this  policy  corresponded  reasonably 
well  with  their  experience.  A  typical  example  is  a 
permanent  single  bit  failure  in  main  memory.  The 
system  typically  hits  this  location  frequently 
(from  a  few  milliseconds  to  10-15  minutes,  depend¬ 
ing  on  the  workload),  corrects  the  error  and  con¬ 


tinues  processing.5  Each  correction  results  in  an 
error  log  resulting  in  a  cluster  of  errors  with  the 
same  symptom.  In  many  cases  it  was  found  that  this 
caused  a  system  termination.  A  sample  of  the  hard¬ 
ware  error  data  obtained  on  this  basis  is  shown  in 
Table  1.  The  number  of  errors  in  a  cluster 
(NP0INTS)  and  the  time  span  of  the  cluster  (SPAN) 
are  also  included  in  each  record. 

TABLE  1 

Sample  error  data  (L0GREC) 
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Since  errors  in  processors  occur  fairly  infre¬ 
quently  (on  the  order  of  once  a  day  for  our  meas¬ 
urements),  correlation  with  workload  requirta  long 
term  workload  figures.  The  workload  data  comes 
from  two  sources:  the  built-in  system  utilization 
facility,  and  a  software  monitor  written  specifi¬ 
cally  for  this  study.  They  are  discussed  below. 

The  operating  systems  in  the  processors  metsured 
use  IBM's  System  Management  Facilities  (SMF)  for 
usage  accounting.  SMF  was  originally  designed  to 
provide  accounting  information,  but  it  has  evolved 
over  the  years  to  include  more  general  performance 
measurement  information.  SMF  is  discussed  exhaus¬ 
tively  elsewhere  (see  [IBM  73],  CButner  80])  and 
will  not  be  detailed  here. 

In  general.  SMF  data  consists  of  records  giving 
resource  utilization  figures  for  jobs.  files,  I/O 
devices,  and  a  potpourri  of  statistics  gathered  and 
written  on  a  periodic  basis.  For  this  work  we  use 
the  type  4  (Step)  record,  which  holds  statistics 
for  each  job  step  as  it  completes  execution*  end 
the  type  1  (Uait)  record,  written  roughly  every  10 
minutes,  which  summarizes  global  system  utilization 
during  that  10  minute  period.  With  careful  pro¬ 
cessing,  SMF  can  provide  excellent  workload  statis¬ 
tics.  especially  when  high  resolution  results  are 
not  needed. 

To  obtain  more  detailed  information  about  tran¬ 
sient  behavior  in  the  CPU  we  implemented  an  inter¬ 
rupt  rate  monitor*  called  INTRACK.  There  are  four 


5  If  the  error  is  more  serious,  the  system  can 
recover  by  retrying  the  instruction  or  by  abort¬ 
ing  the  current  task. 
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classes  of  interrupts  in  the  IBM  370  arch  >  tec  tur  e 

1.  External  (EXT)  —  Used  by  the  operating  system 
♦or  clocks  and  inter-CPU  communication. 

2.  Supervisor  Call  (SVC)  —  Caused  by  any  SVC 

instruction.  Used  for  operating  system  servi¬ 
ces*  such  as:  memory  allocation,  synchronisa¬ 

tion.  I/O,  timing,  etc. 

3.  Program  (PROC)  —  Program  traps  due  to  arith¬ 
metic  conditions  (e.g.  division  by  zero), 
invalid  operations,  or  page  faults. 

4.  Input/Output  (I/O)  —  From  completion  of  I/O 

operat i ons . 

The  interrupt  monitor  ( INTRACK )  archived  the 
interrupt  data  along  uith  the  SMF  aata  described 
above.  Table  2  summarizes  the  sources  of  data  for 
the  workload  information. 

TABLE  2 

Input  data  for  workload  variables. 


Itcord 

it  ad 

Con  tan  t  s  waad  ' 

1 

!!«• 

at  ol  aach 

batch  job 

1 

Accounting  and  job  utaqa 

data,  a  9 ■  CPU  »»••.  *o  o«  [ 

J/Oa.  j 

watt 

Mam  t'lfv 

cru  wait  ti*a  dorm*  pracad- 
ing  10  atnul*  parted 

INTIACK 

I 

Mora* 1 1 v  t»»f» 
(tut  aattablal 

IQ  atnwtas 

Contant*  ol  tour  cvwulit'vf  i 
interrupt  counters  tor  , 
(itaraal.  SvC.  o<jr»*.  i/o 

OVER  V I  EU  Qf  l» £  MEASUREMENT  SYSTEM 

An  objective  of  the  measurement  system  was  to  make 
data  management  as  automatic  ss  possible  so  that  it 
is  unnecessary  to  know  the  particulars  of  operating 
systems,  software  monitors,  record  formats,  and  the 
like.  The  Statistical  Analysis  System  (hereafter 
called  SAS)  [SAS  79]  provided  a  rich  environment 
for  data  handling,  in  addition  to  its  procedures 
for  statistical  analysis.  Once  a  few  programs  were 
written  to  capture  and  reduce  the  raw  data,  the 
information  was  immediately  built  into  SAS  data 
bases  (called  SAS  data  sets).  on  which  the  full 
power  of  SAS  could  be  used  to  sort,  select,  merge, 
and  extract  information.  More  than  50  SAS  pro¬ 
grams.  some  very  simple,  were  written  to  perform  a 
variety  of  data  handling  operations  on  the  data 
bases  This  section  discusses  the  system  as  a 
whole.  describing  the  flow  of  data  in  general 


*  Machine  check  interrupts  are  not  considered  here 
because  they  are  already  collected  in  the  10GREC 

data. 


terms.  Later,  important  components  such  as  error 
clustering  and  workload  smearing  are  covered  tn 

ue  t  a  i  I  . 

The  transformation  of  raw  workload  and  error 
data  into  usable  data  bases  for  analysis  is  per¬ 
formed  by  a  collection  of  programs,  some  written  in 
PL/I  and  many  written  in  SAS.  Refer  to  Figure  3 
for  the  organization  of  these  processors  and  the 
flow  of  data  through  them  as  they  are  described  m 
the  following  sections. 


Figure  3:  Detailed  data  flow  in  the  measurement 
system . 


Processing  the  Uorkload  Data 

Uorkload  processing  begins  with  a  program  writ¬ 
ten  to  select  and  condense  a  specified  set  of  SMF 
record  types.  This  program  is  used  to  process  the 
thirty  reels  of  tape  comprising  the  archived  SMF 
data  from  1979  to  the  present. 

Five  minute  interval s  and  smear i ng . 

A  number  of  workload  variables  are  defined  to 
provide  estimates  of  various  characteristics  of 
system  load  throughout  the  three  year  measurement 
period.  They  are  summarized  in  Table  3.  The  uork¬ 
load  time  granularity  was  defined  to  be  five  min¬ 
utes,  meaning  that  for  each  five-minute  period  from 
January  1979  a  vector  of  13  workload  variables  was 
created.  The  process  described  below  is  applied  to 
each  of  the  variables.  Essentially,  the  process 
takes  what  information  is  available  in  a  record  and 


distribute*  it  into  the  time  slots  the  record 
describes . 

TABLE  3 

Definitions  of  workload  variables. 


Name 

Uni  ts 

Indicates 

COREQ 

KBytes 

Batch  memory  requests 

CORFU 

KBytes 

Batch  memory  usage 

VOL WA l T 

sec . 

Batch  I/O  wait  time 

EXCP 

1/sec 

Batch  induced  I/O  load 

PACE  1 

T/sec 

Batch  paging  (in) 

PA  CEO 

l/sec 

Batch  paging  (out) 

BATCPU 

f  rac  t ion 

Batch  CPU  usage 

SYSCPU 

fraction 

Nonbatch  CPU.  Ovhd.,  etc. 

TOTCPU 

fraction 

Overall  CPU  load 

EXT 

1/sec 

Timer  and  clock  activity 

SVC 

1/sec 

Overal 1  0. S.  activity 

PROG 

l/sec 

Paging/prog,  exceptions 

I/O 

l/sec 

Overall  I/O  activity 

Each  input  record  provides  a  starting  time.  an 
ending  time,  and  a  value  for  one  or  more  load  meas- 
ures.  Cach  of  these  measures  is  ’’smeared"  into  the 
five-minute  bins  defined  by  the  starting  and  ending 
time  of  the  event.  either  on  a  proportional  basis 
(for  variables  representing  counts  or  times),  or 
directly  (for  "level"  variables.  such  as  memory 
usage).  The  algorithm  also  takes  care  of  the  sub' 
tie  handling  of  partial  bins  at  the  interval  end¬ 
points.  in  addition  to  the  case  where  both  end¬ 
points  lie  somewhere  in  the  same  bin.  Tor  these 
cases  the  amount  accumulated  into  the  bin  is 
weighted  by  the  fraction  of  time  spent  in  the  bin. 
figure  4  presents  an  actual  numerical  example  with 
tour  jobs  overlapping  in  various  ways.  Notice  that 
the  height  of  each  bin  is  the  sum  of  the  time  aver¬ 
aged  values  of  input  values  entering  that  bin. 
This  averaging  is  similar  to  approximations  that 
occur  in  numerical  integration  problems. 

As  stated  earlier.  the  smearing  is  done  one 
month  at  a  timer  with  approximately  8640  bins  per 
month.  depending  on  the  number  of  days  in  the 
month.  Finally.  the  estimates  are  concatenated 
into  one-year  groups  to  form  the  "Five-flinute 
Smeared  Data."  For  example,  a  complete  day  of 
smeared  points  (the  288  five-minute  bins) 
for  two  variables  is  given  in  Figure  5.  Each  small 
step  m  the  figure  is  a  five-minute  average;  the 
solid  upper  line  represents  percent  CPU  busy,  the 
dotted  lower  line  is  batch  CPU.  The  plot  shows  the 
familiar  early  morning  lull  between  S  and  8  am  with 
a  dramatic  climb  to  full  utilization  at  about  10 
am  Notice  that  in  the  evening,  from  about  10 
o'clock  on.  batch  work  forms  most  of  the  CPU  load, 
white  during  the  day  it  is  only  in  the  35  to  40X 
range  with  the  remainder  going  to  timesharing  and 
overhead.  It  is  also  interesting  to  note  that  at  a 
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Figure  4:  Example  of  Smearing  Algorithm 


January  5.  1981  Smeared  Batch/Total  CPU 


Tun#  of  D*y  iHour*) 


DOTS  .  Sm««r«4  Botch  CPU  Tim*  (P»rc«nt) 
SOLID  Smeared  Total  CPU  Tun*  (Percent) 


rigur,  5:  One  Day  of  Batch  CPU/Totsl  CPU  Data 


feu  rare  point,  batch  CPU  seem,  to  be  greater  than 
the  total.  Thu  i«  due  to  the  averaging  algor¬ 
ithm's  smearing  of  a  job's  CPU  ussge  evenly  over 
the  job's  duration  uhite  the  total  CPU  figure  is 
derived  from  s  10-minute  global  system  total. 

Jo  study  longer-range  loading  effects  ue  also 
built  a  data  base  of  one-hour  smeared  uorkload  vec- 
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tors.  Each  ont-hour  point  is  derived  from  the 
f  ivt-nmutf  smeared  data  by  averaging  the  twelve 
five-minute  points  in  that  hour  and  tagging  the  new 
point  with  the  starting  time  of  that  hour.  There 
•re  #760  such  vectors  m  a  non-leap  year  Another 
reason  for  creating  the  one-hour  data  is  to  test 
whether  system  crashes  occurring  soon  after  CPU 
errors  cause  the  five-minute  sverages  in  the  period 
preceding  the  error  to  be  artificially  decreased. 
This  could  happen  because  jobs  executing  at  the 
time  of  the  crash  would  not  contribute  to  the 
smeared  totals  as  they  should.  A  preliminary  anal¬ 
ysis  showed  this  not  to  be  s  problem. 

Processing  the  Error  Data  (BUILD) 

This  section  presents  the  method  used  to  process 
raw  errors  into  the  data  base  used  for  analysis.  A 
SAS  program#  called  BUILD,  performs  the  following 
steps ; 

(i)  Select:  The  raw  LOGREC  data  includes  CPU, 
channel,  and  device  errors  for  all  equipment  in  the 
installation.  Oniy  CPU  (Machine  Check)  errors  on 
the  two  370/168s  are  selected  for  analysis. 

(ii)  Decode  and  Cl  assi  fv :  In  each  HCH  record  there 
are  a  number  of  bits  describing  the  type  of  error, 
its  seventy,  and  the  result  of  hardware  and  soft¬ 
ware  attempts  to  recover  from  the  problem.  These 
b»ts  are  decoded  into  classes  meaningful  to  this 
analysis  and  analyzed  in  later  processing.  General 
machine  check  status  indicators  are  provided  by  the 
hardware  are  described  fully  in  the  System/370 
Principles  of  Operation  [IBM  81]). 

(no  sort  fly  Frvttair  md  lust’  )o  Uctiitau 

clustering  in  the  next  step  it  is  necessary  to  sort 
the  data  by  CPU  id  (serial  number)  and  time  of 
error  within  CPU  id. 

(iv)  C I uster :  Errors  occurring  within  S  minutes  of 
each  other  were  coalesced.  Tor  each  error  point, 
the  following  test  was  performed: 


them  together.  Notice  the  difference  in  the 
clustering  statistics  (SPAN  and  NPOINTS)  for  perma¬ 
nent  errors  in  comparison  to  those  for  all  errors. 
Since  permanent  errors  are  defined  by  repeated 
identical  errors.  the  clusters  are  larger.  Clus¬ 
tering  is  important  in  the  error  analysis  to  avoid 
biasing  the  results  with  repeated  errors  from  the 
same  failing  component. 

TABLE  4 

Error  and  Cluster  Statistics 


Error  Statistics 

Period  of  Study:  Jan.  1979  -  Dec.  1981 

Al 1  CPU  Errors :  507 

Permanent  CPU  Errors:  85  (16. 7X  of 

total  ) 

mean  Time  Betu.  Perm.  Errors:  289.8  Hours 


Cluster  Statistics  (ALL) 


NPOINTS 

Mean  4 . 2 
Med i an  10 
90th  Percenti  le  5.0 


SPAN  (seconds) 
20.4 
0.0 
48. 6 


Cluster  Statistics  (Permanent) 


NPOiMS 

Mean  21.9 
Median  9.0 
90th  Percentile  59.6 


SPAH  (seconds) 
175.6 
39.0 
505.0 


Comb i n i nq  Work l pad  and  Error  Da (HATCH) 


IF  (error  type)  =  (type  of  previous  error) 

AN0  (time  away  from  previous  error)  £  5  min¬ 

utes 

T MEN  (fold  error  into  cluster  being  built) 

ELSE  (start  a  new  cluster). 

The  result  is  s  set  of  clustered  errors  for  each 
year.  Associated  with  each  cluster  is  information 
consisting  of  error  classifications.  number  of 
points  in  the  cluster,  time  of  first  and  last 
errors  in  the  cluster,  and  a  variety  of  status  data 
provided  by  the  hardware  and  operating  system. 


The  final  and  most  important  step  of  the  data  base 
building  process  is  the  matching  of  errors  and 
workload.  By  matching  we  mean  the  combining  of 
each  error  point  with  information  on  system  work¬ 
load  at  the  time  of  the  error.  The  clustered  error 
points  are  processed  sequentially  and  tor  each 
point:  (1)  The  time  of  the  five-minute  interval 
preceding  the  error  is  calculated,  and  (2)  used  as 
a  key  to  locate  its  correspond i ng  workload  observa¬ 
tion.  Then  (3)  the  vector  of  workload  variables 
from  that  observation  is  merged  into  the  error 

obser vat i on . 


Summary  error  statistics  for  our  data  (all 
errors  and  perm: nent  errors)  are  given  in  Table  4 
The  number  of  points  in  a  cluster  (NPOINTS)  and  the 
time  spanned  by  a  cluster  (SPAN)  are  also  shown. 
The  cluster  statistics  on  «!!  errors  clearly  shows 
that  the  clustering  algorithm  is  having  an  effect 
by  gathering  long  bursts  of  errors  into  a  few  large 
clusters,  indicated  by  the  maximum  192  points  and 
1310  second  time  span.  The  table  also  shows  that 
lone  errors  predominate,  with  median  duster  size 
of  one  and  time  span  of  zero,  showing  that  the 
clustering  algorithm  is  not  artificially  forcing 


In  order  to  determine  the  load  at  the  time  of 
error,  the  5-minute  load  averages  (which  we  refer 
to  as  smeared  averages)  were  merged  with  the  error 
log.  The  load  at  error  was  taken  to  be  the  load  in 
a  five  minute  interval  prior  to  the  error  to  elimi¬ 
nate  per turbat i ons  from  system  error  recovery  or  a 
system  crash  The  matching  is  shown  in  figure  6. 
Note  that  the  interval  containing  the  error  is  not 
used  because  of  the  measurement  distortion  that  can 
be  caused  by  error  recovery  activities,  and  the 
fact  that  the  system  may  not  continue  to  run  after 
the  error  Also,  the  exigencies  of  a  system  crash 
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Figure  6-  Merging  of  Load  and  Error  Data 


may  prevent  the  operating  system  from  gathering 
workload  and  accounting  statistics. 

in  the  case  of  one-hour  averaged  workload  meas¬ 
urements.  the  algorithm  is  the  same  except  that  the 
previous  hour's  load  is  used. 

summary  al  ihs.  fij ili  fli&t 

Summarizing  the  above  presentations.  the  following 
major  sets  of  data  Mere  created: 

•  Clustered  and  unclustered  "pure1*  errors  -  from 
which  standard  error  analysis  can  be  drawn  to 
obtain  a  number  of  statistics.  eg.  mean  time 
between  errors.  hazard  with  time.  etc.  See 
[Shooman  1964}  for  more  information. 

•  Three  years  of  workload  information  -  also  useful 
for  studies  not  necessarily  related  to  reliabil¬ 
ity.  These  points  exist  in  both  five-minute  and 
one-hour  granul ar i t les . 

•  Errors  matched  with  workload  -  in  both  the  five- 
minute  and  one-hour  forms.  These  observations 
can  be  used  to  study  the  connection  between  load 
and  errors  in  large  computer  systems. 

ANAIYSLS 

Work  Load  and  Error  Analysis 

The  data  consisted  of  three  years  of  load/error 
measurements.  1979,  1980  and  1981.  The  1981  data 

contains  additional  measurements  made  by  our  spe¬ 
cial  purpose  interrupt  monitor.  Initially,  we  ana¬ 
lyzed  eech  veer  separately.  Since  there  was  no 
significant  difference  in  the  1979  and  1980 
results.  it  was  considered  appropriate  to  combine 
the  correspond ing  load-error  data.  Of  the  thirteen 
workload  measures  collected  for  the  study,  four 
were  chosen  to  be  studied  for  1979  and  1980.  They 
were ; 

1.  COACU  —  The  sum  of  memory  allocated  by  batch 
jobs  (K  bytes). 


f  a  commonplace  analogy  to  illustrate  the  above 
distinction  is  that  automobiles  travelling  at  150 
mph  have  a  higher  probability  of  accident  than 


2.  EXCP  —  The  I/O  mitiatation  rate  by  batch 
jobs  Il/Os  per  second). 

3.  SYSCPU  —  CPU  utilization  for  system.  i.e. 
non-batch,  tasks  (a  fraction  between  0  and  1). 

4.  TOT c PU  —  Total  CPU  usage  (a  fraction  between  0 
and  1 ) . 

For  1981  the  following  interrupt  measurements  were 
also  included: 

1.  SVC  —  Supervisor  calls  Crate  per  second). 

2.  10  —  I/O  interrupts,  completion  of  1/0  opera¬ 
tions  (rate  per  second). 

3.  PROG  —  Program  interrupts  (rate  per  second). 

Measures  such  as  the  SYSCPU  and  10  provide  a  meas¬ 
ure  of  the  system  interactive  load,  while  measures 
such  as  TOT  CPU  provide  a  general  view  of  the  CPU 
usage.  The  variable  "BATCPU".  derived  from  the 
difference  between  TQTCPU  and  SYSCPU.  is  a  direct 
measure  of  batch  usage. 

Recall  that  the  data  base  developed  contains  not 
only  the  values  for  the  specified  workload  vari¬ 
ables  to  a  five  minute  resolution  but  also  the  val¬ 
ues  of  the  same  variables  matched  with  error  times. 
From  this  data  two  types  of  distributions  were 
developed.  The  first,  -l(x)  is  simply  the  distribu¬ 
tion  of  the  workload  variable  in  question 

JUx)  *  Pr  (workload  =  x) 

The  second  is  the  joint  distribution  of  an  error 
and  the  workload  measure: 

f ( x )  s  Pr  (error  occurs  and  load  -  x). 

In  this  expression,  errors  and  load  values  are  rep¬ 
resented  as  they  occur  on  an  actual  system,  where 
favored  loads  contribute  more  to  the  distribution 
than  loads  of  low  probability.  To  remove  this 
effect  we  divide  f(x)  by  the  associated  load  prob¬ 
ability  X(x).  Using  the  well  known  notion  of  a 
conditional  probability  distribution  (Feller  68}  we 
ur  i  te 

f  (x) 

g(x)  s  Pr  (error  occurs  |  load  =  x)  =  - 

XCx) 

Therefore  g(x)  can  be  thought  of  as  the  probability 
of  an  error  at  a  given  load  when  1,11  loads  are 
eoua I  I y  represented;  it  is  the  conditional  error 
probability.  In  the  figure  g(x)  represents  the 
conditional  probablities  arranged  by  increasing  x 
(workload).  Note  that,  since  each  of  these  pro- 
bablities  is  calculated  independently.  g(x)  is  not 
a  probability  distribution  in  the  regular  sense  of 
the  term.^ 

those  travelling  at  55  mph.  However,  there  art 
far  more  accidents  for  autos  going  55.  To  obtain 
an  accurate  representat ion  of  the  risks  involved 
in  travelling  at  high  speed.  we  must  divide  the 
number  of  accidents  occurring  at  each  speed  by 
the  number  of  autos  travelling  at  that  speed. 
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Figures  7  and  3  depict  the  £,  f#  and  g  distribu¬ 
tions  of  System  CPU  (SYSCPU)  and  I'0  and  Total  CPU 
(TOTCPU)  for  1981 . 

As  a  general  observation  we  note  that,  where  the 
difference  between  £(x)  and  f(x)  is  considerable, 
ue  might  expect  to  see  a  workload  dependency  in  the 
errors.  If  £(x)  and  f(x)  are  similar,  the  rela¬ 
tionship  is  probably  not  significant.  A  g(x)  dis¬ 
tribution  weighted  in  favor  of  higher  workload  val¬ 
ues  will  clearly  generate  a  higher  risk  of  an 
error,  if  the  load  increases. 


TH£  LOAD  HAZARD  FQZIL 

The  object  of  the  analysis  was  to  determine: 

1.  Does  a  higher  level  of  system  utilization 
result  tn  a  higher  risk  of  a  permanent  error 
than  a  lower  level? 

2.  Is  the  relationship  linear  with  the  workload 
variables.  or  is  there  a  nonlinear  increasing 
effect? 


It  would  appear  from  the  glx)  plots  for  SYSCPU 
and  10  that  higher  values  of  these  measures  (>  50 
for  10)  contribute  more  significantly  to  permanent 
errors  than  the  lower  values.  Examining  the  plots 
for  TOTCPU  ue  note  that,  as  measured  by  CPU  utili¬ 
zation,  the  system  was  heavily  loaded  most  of  the 
time.  The  £(x>  and  g(x)  plots  for  TOTCPU  show  con¬ 
siderable  similarity.  It  would  therefore  appear 
from  this  cursory  analysis  that  permanent  errors 
are  not  induced  by  higher  execution  rates,  as  meas¬ 
ured  by  CPU  usage  alone. 

In  order  to  quantify  this  effect.  in  particular 
to  determine  exactly  the  risk  or  "hazard"  associ¬ 
ated  with  higher  workload  values.  we  employed  what 
we  refer  to  as  a  "load  hazard"  model.  the  develop¬ 
ment  and  application  of  which  is  discussed  in  the 
next  section. 


In  practical  terms,  if  such  an  effect  exists,  it  is 
expected  that  the  load  will  act  as  a  stress  factor, 
for  this  purpose  we  developed  and  validated  a 
load-hazard  model  which  formed  the  basis  for  our 
tests.  A  detailed  description  of  the  development 
and  validation  of  this  model  appears  in  [Iyer  82b]. 
Briefly,  an  inherent  load  hazard  z(x)  is  defined  as 


Pr  {Error  in  toad  interval  (x.  x*4x ) } 

z  (  x  )  =  -  (1) 

Pr  (No  Error  in  load  interval  (0.  x)} 
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In  close  analogy  with  with  the  classical  hazard 
rat*  in  reliability  theory  [Shooman  66],  z(x)  meas- 
u  +%  the  incremental  risk  involved  in  increasing 
the  workload  from  *  to  x»Ax#  Ce.g.  if  the  system  is 
currently  operating  at  6Q  percent  of  full  load.  as 
measured  by  CPU  usage.  what  is  the  increase  in  the 
risk  of  a  permanent  error  if  the  load  is  increased 
to  90  percent?) 

The  numerator  of  z(x)  was  determined  from  g(x). 
The  survival  probability  in  the  denominator  (i.e. 
the  probability  of  no  permanent  errors  in  the  load 
interval  (0.*))  was  for  practical  purposes  found  to 
be  very  close  to  the  probability  of  reaching  a 
given  workload  or  higher  (determined  from  the  work¬ 
load  distribution  l(x>).  This  is  simply  due  to  the 
fact  that,  in  our  data,  error  events  are  much  fewer 
than  the  five  minute  workload  samples.  Conse¬ 
quently.  most  often.  when  a  given  workload  is 
reached  no  error  has  occurred  (i.e.  permanent 
errors  are  quite  infrequent). 

tf  z ( x )  increases  with  x.  it  should  be  clear 
that  there  is  an  increasing  risk  of  a  permanent 
error  as  the  workload  variable  increases.  If.  how¬ 
ever,  z(x)  remains  constant  for  increasing  x,  we 
may  surmise  that  no  increased  risk  is  involved. 

Note  that  in  our  definition  of  load  hazard  we 
have  removed  the  variability  of  system  load  by 
using  the  conditional  probability  g(x).  This  of 


course  is  not  true  in  practice  since  load  is  best 
described  as  a  random  variable  with  a  probability 
distribution;  it  is  simply  the  associated  load  dis¬ 
tribution,  ilx),  defined  above.  In  order  to  deter¬ 
mine  the  hazard  for  a  particular  load  pattern,  we 
must  multiply  the  associated  load  probability  by 
the  hazard  calculated  in  (Y).  Oenoting  by  z*(x) 
the  transformed  hazard,  we  have 

z*(x)  =  zlx)  A(x)  (2) 


*  In  applying  the  load  hazard  model  to  our  data  we 
made  a  simplifying  assumption  that  the  workload 
monotomcally  increases  until  an  error  occurs- 
This  is  a  conservative  assumption  which  was  made 
primarily  to  simplify  some  cumbersome  aspects  of 
the  data  analysis.  It  has  the  additional  advan¬ 
tage  of  allowing  us  to  estimate  a  lower  bound  on 
the  workload  related  risk  (if  any).  This  is  due 
to  the  fact  that  under  the  assumption  of  a  mono- 
tonically  increasing  workload.  factors  such  as 
cycling  (between  low  and  high  usage)  and  other 
random  variations  are  ignored.  It  is  well  known 
that  such  stresses  only  serve  to  add  to  the  haz¬ 
ard  rate  [Kujowski  76],  [Arsenault  60].  Thus  by 
neglecting  them  we  underestimate  the  hazard  being 
measured . 
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In  close  analogy  with  with  the  classical  hazard 
rat*  in  reliability  theory  tShooman  65],  z(x)  meas¬ 
ures  the  incremental  risk  involved  in  increasing 
the  workload  from  x  to  x*Ax#  (e.g.  if  the  system  is 
currently  operating  at  50  percent  of  full  load.  as 
measured  by  CPU  usage.  what  is  the  increase  in  the 
risk  of  a  permanent  error  if  the  load  is  increased 
to  90  percent?) 


course  is  not  true  in  practice  since  load  is  best 
described  as  a  random  variable  with  a  probability 
distribution;  it  is  simply  the  associated  load  dis¬ 
tribution,  £(x),  defined  above.  In  order  to  deter¬ 
mine  the  hazard  for  a  particular  load  pattern,  we 
must  multiply  the  associated  load  probability  by 
the  hazard  calculated  in  (1).  Denoting  by  z»(x) 
the  transformed  hazard,  we  have 


The  numerator  of  2(x)  was  determined  from  g(x). 
The  survival  probability  in  the  denominator  li.e. 
th*  probability  of  no  permanent  errors  in  the  load 
interval  (0.x))  was  for  practical  purposes  found  to 
be  very  close  to  the  probability  of  reaching  a 
given  workload  or  higher  (determined  from  the  work¬ 
load  distribution  l(x)J.  This  is  simply  due  to  the 
fact  that,  in  our  data,  error  events  are  much  fewer 
than  the  five  minute  workload  samples.  Conse¬ 
quently,  most  often.  when  a  given  workload  is 
reached  no  error  has  occurred  (i.e.  permanent 
errors  are  quite  infrequent). 

If  z(x)  increases  with  x.  it  should  be  clear 
that  there  is  an  increasing  risk  of  a  permanent 
error  as  the  workload  variable  increases.  If.  how¬ 
ever.  z(x)  remains  constant  for  increasing  x.  we 
may  surmise  that  no  increased  risk  is  involved. 

Note  that  m  our  definition  of  load  hazard  we 
have  removed  the  variability  of  system  load  by 
using  the  conditional  probability  g(x).  This  of 


z • ( x )  =  z ( x )  X(x)  (2) 


*  In  applying  the  load  hazard  model  to  our  data  we 
made  a  simplifying  assumption  that  the  workload 
monotoni cal  I y  increases  until  an  error  occurs. 
This  is  a  conservative  assumption  which  was  made 
primarily  to  simplify  some  cumbersome  aspects  of 
the  data  analysis.  It  has  the  additional  advan¬ 
tage  of  allowing  us  to  estimate  a  lower  bound  on 
the  workload  related  risk  (if  any).  This  is  due 
to  the  fact  that  under  the  assumption  of  a  mono- 
tomcally  increasing  workload.  factors  such  as 
cycling  (between  low  and  high  usage)  and  other 
random  variations  are  ignored.  It  is  well  known 
that  such  stresses  only  serve  to  add  to  the  haz¬ 
ard  rate  [Kujowski  75],  (Arsenault  50].  Thus  by 
neglecting  them  we  underestimate  the  hazard  bting 
measured . 
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Ue  refer  to  the  hazard  z(x).  as  defined  in  (U. 
as  the  fundamental  hazard.  This  is  because  it  can 
be  thought  of  as  an  inherent  property  of  a  particu- 
I ar  system  and  is  not  subject  to  varying  load  pat¬ 
terns.  Uhen  a  varying  toad  pattern  is  taken  into 
account.  it  can  be  thought  of  as  "picking  out" 
aspects  of  the  fundamental  hazard  function.  This 
hazard  z.(x)  defined  in  (2)  Mill  be  referred  to  as 
the  apparent  hazard.  since  it  is  closely  dependent 
on  the  load  distribution. 

hazard  PLOTS 

The  generation  of  the  hazard  plots  and  associated 
statistics  involved  extensive  data  processing.  In 
each  hazard  plot.  z(x)  or  z,(x)  is  calculated  and 
plotted  as  a  function  of  a  chosen  workload  vari¬ 
able.  x.  The  permanent  errors  which  generate  the 
plots  occur  due  to  a  number  of  causes;  examples 
are:  temperature,  humidity,  random  noise,  mechani¬ 
cal  failures,  and  design  errors.  some  of  which  are 
unrelated  to  our  study.  Those  factors  not  related 
to  load  can  be  expected  to  behave  as  noise  m  a 
load-error  analysis.  If  these  other  factors  are 
predominant,  we  can  expect  to  find  no  discernable 
pattern  in  our  hazard  plots.  i.e.  they  should 
appear  as  uncorrelated  clouds.  This  is  well  under- 
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stood  m  any  statistical  study  of  dependencies. 

An  easily  discernable  pattern,  on  the  other 
hand,  would  indicate  that  the  load-error  dependency 
dominates  others.  The  strength  of  such  a  relation¬ 
ship  can  be  measured  through  regression.  figures 
10.  11.  and  12  depict  the  hazard  plots  for  the 

three  selected  load  parameters.  The  regression 
coefficten  R*.  which  is  an  effective  measure  of 
the  goodness  of  fit.  is  provided  for  each  plot 
Quite  simply,  it  measures  the  amount  of  variability 
in  the  data  that  can  be  accounted  for  by  the 
regression  model.  R*  values  of  greater  than  0.6 
(corresponding  to  an  R  >  0.75)  are  generally 

interpreted  as  strong  relationships*  [Younger  791. 
It  can  be  seen  that  the  hazards  are  increasing  with 
each  of  the  load  parameters  shown.  The  relation¬ 
ship  is  particularly  strong  with  SYSCPU,  10  and 
EXCP.  although  other  measures  such  as  SVC.  and  PROG 
(plots  not  shown)  also  correlate  strongly.  Note 
that  these  variables  measure  the  interactive  work¬ 
load  with  some  degree  of  overlap  and.  have  differ¬ 
ent  degrees  of  variability.  TQTCPU.  a  general 
measure  of  execution  also  correlates  moderately 
strongly.  In  addition.  it  is  seen  that  the  work¬ 
load-error  relationship  is  highly  non-linear.  This 
appears  to  indicate  toward  the  existence  of  a 
threshold  beyond  which  the  system  worsens  very  rap¬ 
idly. 

It  is  interesting  to  note  that  most  of  the  esti¬ 
mated  permanent  errors  were  failures  in  main  mem¬ 
ory  An  analysis  of  these  errors  by  time  of  day 
showed  that  they  generally  occur  during  the  period 
uhen  the  main  memory  access  rate  and  the  interac¬ 
tive  workload  measures  (e.g  SYSCPU  and  10)  are  the 
highest  (i.e.  during  prime  time).  This  is  shown  in 
fig.  13  which  gives  both  the  permanent  errors  and 
the  average  1/0  rate  by  hour  of  day. 


DISCUSSION  £F  RESULTS 

The  analysis  shows  that  there  is  a  strong  load 
dependency  of  permanent  CPU  errors  at  SIAC. 
Strictly  speaking  the  data  refers  only  to  the  mani¬ 
festation  of  a  CPU  related  error,  i.e.  the  observa¬ 
tion  of  the  error  and  not  its  occurrence.  It  is. 
however,  possible  to  estimate  the  average  latency 
of  an  error,  say  in  main  memory,  from  the  measured 
values  of  the  paging  rate.  Using  the  lowest  values 
of  the  measured  paging  rate.  it  is  esimated  that 
the  time  for  most  page  frames  in  main  memory  to 
incur  a  page  transfer  is  about  twenty  minutes.  The 
time  required  to  produce  a  significant  change  in 
the  workload  measures  to  affect  our  results  is 
ahout  an  hour.  Hence,  the  latency  time  is  instg- 


*  (he  range  of  |fi|  from  0  to  1  is  typically  divided 
as  fol'ous:  (0.  0.25)  moderately  weak;  (0  25. 
0.5)  moderate;  (0.5.  0.75)  moderately  strong. 

(0.75,  1.0)  s  trong . 


mficant  when  compared  with  the  time  required  to 
produce  measurable  change  in  the  workload. 
Accordmgly.  within  the  sensitivty  of  our  data,  the 
observation  of  a  permanent  error  almost  coincides 
with  its  occurrence.  This  observation  is  also  con¬ 
firmed  by  studies  on  fault  latency  reported  in 
Uala  83]  This  studies  found  that  the  latency 
time  of  detectable  error*  was  very  short  indeed- 
Most  of  the  undetectable  errors  were  in  remote 
locations  or  had  "don't  care"  conditions. 

A  preliminary  examination  of  the  semiconductor 
device  literature  shows  that  some  experimental  and 
quantitative  evidence  exists  to  support  our 
results.  for  example,  the  effect  of  transient  and 
intermittent  loading  on  the  rating  of  power  devices 
has  been  studied  at  length;  see  (Ivalo  61]  and 
[Blackbjrn  74]  for  detail*.  It  is  well  known  that 
the  duty  cycle  of  the  input  pulses  is  an  important 
parameter  in  determining  the  rating  and  the  life¬ 
time  of  such  devices  for  pulsed  operation.  [Owen 
60]  describes  practical  methods  commonly  employed 
to  evaluate  the  thermal  effects  of  repetitive 
pulsed  loading.  Oetailed  analytical  and  experimen¬ 
tal  analysis  of  both  steady  state  and  transient 
thermal  behavior  is  discussed  in  (Newell  75], 

There  is  also  evidence  in  the  general  reliabil¬ 
ity  literature  which  relates  low  and  high  usage 
rates  of  avionic  and  navigational  equipment  with 
cor  respond  tng  reliability  behavior;  see  (Shurmar* 
75]  and  [Kujowski  76]  for  details.  It  is  to  be 
noted  that  in  each  of  these  two  studies  a  signifi¬ 
cant  component  of  the  system  was  electronic  or 
digital.  Our  measurements  show  that  the  effect  is 
not  negligible  in  smaller  devices. 

CflNOjJEIiU  BXM&&S 

It  has  been  the  purpose  of  this  paper  to  describe 
the  measurement  and  analysis  of  permanent  CPU 
related  errors  and  system  activity  at  the  Stanford 
Linear  Accelarator  Center  computation  facility. 
Between  13  and  18  percent  of  all  errors  affecting 
the  CPU  were  estimated  to  be  permanent.  The  mani¬ 
festation  of  a  permanent  error  was  found  to  be 
strongly  correlated  with  the  level  and  type  of 
workload  prior  to  the  manifestation  of  the  error. 
Tor  example.  it  is  shown  that  the  risk  of  a  perma- 
ner.t  error  increases  in  a  non-linear  fashion  with 
the  amount  of  interactive  processing.  The  observed 
tendency  is  present  in  three  years  of  load  data. 
Ihis  observation  is  significant  because  a  load-er¬ 
ror  relationship  found  at  the  CPU  level  must,  in 
our  *  lew,  be  considered  fundamental.  In  addition, 
»n  a  majority  of  the  observed  errors.  the  latency 
between  the  occurrence  and  the  manifestation  of  the 
error  was  estimated  to  be  insignificant  for  the 
cjrposes  of  our  analysis.  Thus  the  detection  of 
the  error  also  provides  an  estimate  of  the  occur- 
r ence  of  the  error . 

As  with  any  statistical  analysis,  this  is  not 
proof  in  itself  More  measurements  and  experiments 
are  necessary  to  further  study  this  problem.  How¬ 
ever.  the  increasing  body  of  evidence  accumulated 
on  different  computers  with  differing  load  and 
failure  patterns  shows  that  workload  should  be  con¬ 


sidered  as  a  factor  relating  to  reliability.  Work¬ 
load  can  be  thought  of  as  a  stress  on  the  system, 
.ith  greater  stresses  resulting  in  greater  risk  of 
failure.  In  view  of  our  previous  results#  we 
believe  that  the  error  process  which  ensues  is  com¬ 
posed  of  two  separate  effects.  The  first  is  the 
(constantl  inherent  failure  rate.  This  is  deter¬ 
mined  through  classical  reliability  techniques 
[Shooman  65],  taking  into  consideration  such  fac¬ 
tors  as  topology,  redundancy  etc.  The  second  is 
the  u t i I i 2a t i on- i nduced  failure  rate.  This  rate  is 
dependent  upon  both  the  absolute  level  of  system 
utilization  and  the  rate  of  change  of  that  level. 
By  an  absolute  level  we  mean  an  obviously  measura¬ 
ble  level;  eg..  CPU  utilization,  memory  occupancy, 
etc.  Through  the  rate  of  change  of  utilization  we 
are  attempting  to  measure  the  rate  at  which 
transitions  occur  between  various  system  states, 
eg.  the  transitions  of  the  CPU  into  and  out  of 
the  busy  state.  In  most  cases  the  effect  of  this 
stress  is  not  permanent,  since  most  errors  are 
transient  [Iyer  82b].  However,  as  demonstrated  in 
this  paper,  there  is  a  significant  contribution  due 
to  permanent  errors  in  the  CPU  and  main  storage. 

The  design  of  computer  systems  will  be  greatly 
aided  if  this  type  of  analysis  can  help  uncover 
cause  and  effect  relationships  in  permanent  errors. 
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